Introduction

This project is centered around predicting flight delays from a number of variables, with each entry in the data set being a specific flight from the year of 2008. The arrival delay of a particular flight is the outcome that we are interested in with various other factors prior to arrival being of interest. This data has flights from numerous carriers in a variety of geographical locations.

Loading Packages

These are the packages that we load for various model-building, plotting, and organizational tasks.

library(ISLR)
library(corrplot)  
library(discrim)  
library(corrr) 
library(kknn)
library(knitr)   
library(MASS)   
library(tidyverse)   
library(tidymodels)
library(ggplot2)   
library(ggrepel)
library(ggimage)
library(rpart.plot)
library(vip)         
library(vembedr)     
library(janitor)     
library(randomForest)  
library(stringr)   
library("dplyr")     
library("yardstick")
tidymodels_prefer()

Loading in our Data

flight_data <- read.csv('/Users/kerouac/Downloads/DelayedFlights.csv')
save(flight_data, file = '~/project_save_files/flight_data.rda')

We read in the csv file and then saved the file in order load it in after and save computing time.

load(file ='~/project_save_files/flight_data.rda' )

In order to get a clear picture of the data, let’s go over the different variables and a summary of what is represented in the set. The variables and a short description are described below

Year 2008 for all flights

Month: 1-12 is January through February

DayOfMonth: 1-31 possible days of the month

DayOfWeek: 1 (Monday) through 7 (Sunday)

DepTime: Actual departure time in local time

CRSDepTime: Scheduled departure time

ArrTime Actual arrival time in local time

CRSArrTime Ccheduled arrival time in local time

UniqueCarrier: Unique Carrier Code (airline)

FlightNum: Specific Flight Number

TailNum: Airplane Tail Number which is unique for each aircraft

ActualElapsedTime: Actual Elapsed Time in minutes

CRSElapsedTime: Scheduled Elapsed Time in minutes AirTime: Time in the air in minutes

ArrDelay: Arrival delay, in minutes

DepDelay: Departure delay, in minutes

Origin: origin IATA airport code

Dest: destination IATA airport code

Distance: Distance of trip in miles

TaxiIn: taxi in time, in minutes

TaxiOut: taxi out time in minutes

Cancelled: was the flight cancelled?

CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)

Diverted 1 = yes, 0 = no

CarrierDelay in minutes: Carrier delay is any delay that is the fault of the Carrier. Examples of this would be aircraft damage, cleaning of the aircraft, inspection etc.

WeatherDelay: Weather delay is delay caused by hazardous weather

NASDelay in minutes: Delay that is caused by the the National Airspace System (NAS), generally operations at a particular airport such as heavy traffic

SecurityDelay in minutes: Security delay is security measures/events causing delay

LateAircraftDelay in minutes: Arrival delay at an airport that is due late arrival of the same aircraft at a previous airport.

Organizing the Data

dim(flight_data)
## [1] 1936758      30

Here we can see what we are working with in the data. We have 1,936,758 observations and 30 variables.

set.seed(1234)

carriers <- flight_data$UniqueCarrier

endeavor_air <- flight_data %>% 
  filter(UniqueCarrier == '9E')
american_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'AA')
aloha_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'AQ')
alaska_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'AS')
jetblue_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'B6')
continental_air <- flight_data %>% 
  filter(UniqueCarrier == 'CO')
delta_airlines<- flight_data %>% 
  filter(UniqueCarrier == 'DL')
expressjet_air <- flight_data %>% 
  filter(UniqueCarrier == 'EV')
frontier_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'F9')
airtran_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'FL')
hawaiian_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'HA')
envoy_airlines <-  flight_data %>% 
  filter(UniqueCarrier == 'MQ')
northwest_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'NW')
psa_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'OH')
skywest_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'OO')
united_airlines <- flight_data %>% 
  filter(UniqueCarrier == 'UA')
us_airways<- flight_data %>% 
  filter(UniqueCarrier == 'US')
southwest_airlines<- flight_data %>% 
  filter(UniqueCarrier == 'WN')
jsx_airlines<- flight_data %>% 
  filter(UniqueCarrier == 'XE')
mesa_airlines<- flight_data %>% 
  filter(UniqueCarrier == 'YV')

all_airlines <- rbind(sample_n(airtran_airlines, 500),sample_n(alaska_airlines,500),sample_n(aloha_airlines,500),sample_n(american_airlines,500),sample_n(continental_air,500),sample_n(delta_airlines,500),sample_n(endeavor_air,500),sample_n(envoy_airlines,500),sample_n(expressjet_air,500),sample_n(frontier_airlines,500),sample_n(hawaiian_airlines,500),sample_n(jetblue_airlines,500),sample_n(jsx_airlines,500),sample_n(mesa_airlines,500),sample_n(northwest_airlines,500),sample_n(psa_airlines,500),sample_n(skywest_airlines,500),sample_n(southwest_airlines,500),sample_n(united_airlines,500),sample_n(us_airways,500))

all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'UA'] <- 'United'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'US'] <- 'US Airways'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'WN'] <- 'Southwest'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'XE'] <- 'JSX'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'AS'] <- 'Alaska'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'OO'] <- 'Skywest'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'FL'] <- 'Airtran'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'DL'] <- 'Delta'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'CO'] <- 'Continental'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'B6'] <- 'JetBlue'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'HA'] <- 'Hawaiian'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'AA'] <- 'American'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == '9E'] <- 'Endeavor'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'EV'] <- 'ExpressJet'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'F9'] <- 'Frontier'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'OH'] <- 'PSA'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'YV'] <- 'Mesa'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'NW'] <- 'Northwest'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'AQ'] <- 'Aloha'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'MQ'] <- 'Envoy'
all_airlines$UniqueCarrier[all_airlines$UniqueCarrier == 'B6'] <- 'JetBlue'
  • Our data set is quite large, and in order to make it slightly easier to work with we firstly extracted data based on the individual airline. Following this, we took a sample 500 flights from each airline so that our data is evenly distributed across airlines. We then renamed the carrier codes to their respective airline names in order to make it easier to understand.

Exploratory Data Analysis

unique(all_airlines$UniqueCarrier)
##  [1] "Airtran"     "Alaska"      "Aloha"       "American"    "Continental"
##  [6] "Delta"       "Endeavor"    "Envoy"       "ExpressJet"  "Frontier"   
## [11] "Hawaiian"    "JetBlue"     "JSX"         "Mesa"        "Northwest"  
## [16] "PSA"         "Skywest"     "Southwest"   "United"      "US Airways"

These are the various carriers that are operating the flights. There are 20 unique carriers

unique(all_airlines$Origin)
##   [1] "MEM" "FNT" "SEA" "EWR" "ATL" "BWI" "DAB" "PNS" "RSW" "PBI" "BUF" "MLI"
##  [13] "MCO" "SJU" "IAD" "MDW" "HPN" "DFW" "LGA" "MKE" "SAN" "IND" "FLL" "ROC"
##  [25] "CAK" "PIT" "DTW" "MSY" "PHF" "LAS" "CLT" "LAX" "MIA" "BOS" "DAY" "PHL"
##  [37] "SRQ" "MCI" "RDU" "TPA" "JAX" "DEN" "SFO" "STL" "DCA" "BMI" "PHX" "CHS"
##  [49] "SWF" "MSP" "SAV" "GPT" "SAT" "RIC" "BET" "ANC" "LGB" "JNU" "AKN" "GEG"
##  [61] "FAI" "ONT" "PSP" "PDX" "OTZ" "BRW" "OAK" "KTN" "SJC" "SIT" "ORD" "YAK"
##  [73] "DLG" "SCC" "BUR" "SNA" "SMF" "ADQ" "OME" "PSG" "HNL" "KOA" "LIH" "OGG"
##  [85] "ITO" "RNO" "MFE" "AUS" "JFK" "ELP" "TUS" "IAH" "SDF" "TUL" "XNA" "OKC"
##  [97] "STT" "HSV" "COS" "BNA" "BDL" "ICT" "EGE" "ABQ" "CLE" "HDN" "BHM" "CMH"
## [109] "SLC" "CVG" "ORF" "HOU" "GSO" "JAC" "TVC" "DSM" "JAN" "TLH" "OMA" "BTV"
## [121] "GSP" "LEX" "TYS" "DLH" "LIT" "BGR" "ERI" "GTF" "SBN" "PWM" "PIA" "SHV"
## [133] "ABE" "MGM" "HLN" "GRB" "AZO" "LAN" "MOB" "CWA" "IDA" "FSD" "CAE" "FWA"
## [145] "GRR" "STX" "SGF" "BIS" "CID" "PLN" "ELM" "EVV" "PFN" "MSN" "MYR" "MDT"
## [157] "AVP" "MAF" "ACT" "RST" "GRK" "LBB" "SBA" "SYR" "VPS" "LAW" "FSM" "LRD"
## [169] "TOL" "ABI" "BTR" "CHA" "LSE" "MLU" "DAL" "CMI" "MQT" "SPS" "DBQ" "GGG"
## [181] "ROW" "LFT" "PVD" "SJT" "CLL" "ATW" "ALB" "MHT" "FAY" "CRW" "AVL" "VLD"
## [193] "OAJ" "CSG" "TRI" "EWN" "AGS" "ROA" "MLB" "MCN" "ILM" "EYW" "GNV" "ABY"
## [205] "BQK" "AEX" "LYH" "GTR" "BOI" "PSE" "BQN" "AMA" "HRL" "LCH" "ACK" "MRY"
## [217] "TEX" "YUM" "BFL" "SBP" "SPI" "RKS" "ASE" "FLG" "COD" "GCC" "FAT" "DRO"
## [229] "RAP" "MTJ" "GJT" "GFK" "FCA" "FAR" "MOT" "PSC" "ACV" "MSO" "RDM" "SGU"
## [241] "SMX" "MOD" "BTM" "LNK" "MFR" "BZN" "CPR" "EUG" "RDD" "CLD" "SUN" "CIC"
## [253] "LMT" "BIL" "CEC" "TWF" "ISP" "CRP"
length(unique(all_airlines$Origin))
## [1] 258

These are the various Origins by their IATA code. We have 258 unique origins

unique(all_airlines$Dest)
##   [1] "ATL" "TPA" "BWI" "SRQ" "RSW" "IND" "MDW" "ROC" "MSP" "IAD" "MCO" "FNT"
##  [13] "MLI" "PHX" "PBI" "LAS" "DCA" "SFO" "MKE" "PNS" "EWR" "FLL" "LGA" "CHS"
##  [25] "PHL" "BOS" "MCI" "PHF" "JAX" "HOU" "RIC" "CAK" "CLT" "PIT" "BUF" "RDU"
##  [37] "LAX" "DFW" "STL" "DAY" "DTW" "SJU" "HPN" "ICT" "SAN" "BMI" "MSY" "SWF"
##  [49] "GPT" "MEM" "DAB" "DEN" "SEA" "MIA" "SAT" "PWM" "CMH" "ANC" "PDX" "YAK"
##  [61] "OTZ" "SNA" "PSP" "OAK" "OME" "KTN" "SIT" "SMF" "BRW" "GEG" "FAI" "TUS"
##  [73] "WRG" "PSG" "LGB" "AKN" "JNU" "ORD" "ONT" "LIH" "CDV" "SJC" "SCC" "BET"
##  [85] "HNL" "BUR" "DLG" "ITO" "RNO" "OGG" "KOA" "COS" "OMA" "MFE" "TUL" "JFK"
##  [97] "ELP" "BNA" "ABQ" "SLC" "AUS" "MTJ" "OKC" "HSV" "IAH" "STT" "BDL" "CLE"
## [109] "BTR" "BQN" "CVG" "SAV" "BHM" "JAC" "MLB" "TYS" "CAE" "ORF" "VPS" "ERI"
## [121] "PIA" "CID" "LEX" "XNA" "SCE" "MOB" "SDF" "MBS" "AVP" "SGF" "AVL" "JAN"
## [133] "SBN" "DSM" "FSD" "ROA" "LIT" "GFK" "DLH" "FAR" "TLH" "PFN" "SHV" "ELM"
## [145] "MSN" "SYR" "GRR" "GSP" "GSO" "ABE" "TVC" "BZN" "AZO" "LAN" "BTV" "BGM"
## [157] "ATW" "FWA" "BIS" "MGM" "HLN" "CHA" "LNK" "CMX" "MDT" "MYR" "MQT" "ABI"
## [169] "GGG" "PVD" "AEX" "AMA" "LFT" "LBB" "FSM" "TYR" "SJT" "CLL" "GRK" "TXK"
## [181] "ACT" "LRD" "ALB" "CMI" "GJT" "DAL" "ROW" "LSE" "DBQ" "CWA" "GRB" "RST"
## [193] "SBA" "TOL" "CRP" "SPS" "EVV" "MAF" "FAT" "LAW" "CHO" "TRI" "GNV" "GTR"
## [205] "OAJ" "CRW" "DHN" "EYW" "FAY" "ABY" "MLU" "FLO" "AGS" "HHH" "VLD" "ILM"
## [217] "CSG" "EWN" "MHT" "BOI" "PSE" "BPT" "BRO" "HRL" "BFL" "BGR" "IDA" "MRY"
## [229] "ACK" "DRO" "SBP" "CPR" "YUM" "ASE" "GUC" "EUG" "EGE" "SPI" "MFR" "RAP"
## [241] "BIL" "MSO" "SUN" "CEC" "RDD" "IPL" "CIC" "HDN" "RFD" "RDM" "CLD" "PMD"
## [253] "PSC" "TWF" "MOD" "GTF" "OXR" "SMX" "FCA" "OTH" "ACV" "BTM" "SLE" "ISP"
length(unique(all_airlines$Dest))
## [1] 264

These are the various destinations by their IATA code. There are 264 unique destinations

all_airlines_full_values <- drop_na(all_airlines)
all_airlines_full_values$Month <- as.factor(all_airlines_full_values$Month)
all_airlines_full_values$DayOfWeek <- as.factor(all_airlines_full_values$DayOfWeek)

Here we are dropping the NA values from our dataframe because these values do not help with our models. Additionally, it makes sense to change the Month and Day of Week into ordered factors there are a set number of levels for each of these (12 months in a year, 7 days in a week)

ggplot(all_airlines_full_values)+ geom_boxplot(aes(y = ArrDelay, x = ordered(Month, levels = c(1:12))), colour = 'Blue', fill = 'Orange')+labs(title = "Delays By Month",
              subtitle = 'Boxplot',
              x = "Month", y = "Delay")

Here is a box plot that shows how delays are distributed throughout the year depending on the month. We can get a general idea about the mean and variances of each month from this plot.

by_month_dat <- all_airlines_full_values %>% 
  group_by(month = Month) %>% 
  summarize(ArrDelay = mean(ArrDelay))

ggplot(by_month_dat)+
  geom_col(aes(x =month, y = ArrDelay), fill ='blue')

This plot shows the mean Arrival Delay based upon the month of the year which shows that there isn’t a great amount of variation from month to month.

by_month_weather <- all_airlines_full_values %>% 
  group_by(month = Month) %>% 
  summarize(WeatherDelay = mean(WeatherDelay))

ggplot(by_month_weather)+
  geom_col(aes(x =month, y = WeatherDelay), fill = 'forestgreen')

This plot shows the average Weather Delay based upon the month of the year, with some months seeing more of a weather delay than others.

various_origins <- all_airlines_full_values %>% 
  group_by(Origin)

origin_delay_flights <- various_origins %>% 
  summarise(avg_delay = mean(ArrDelay),
            num_of_flights = length(unique(FlightNum)))

We can examine the amount of flights for each Origin and the mean delay for that origin.

origins_of_interest <- origin_delay_flights %>% 
  filter(num_of_flights>50) %>% 
  arrange(-avg_delay)

ggplot(origins_of_interest)+
  geom_col(aes(x = Origin, y = avg_delay), fill = 'red4')+
  labs(y = 'Average Delay', title = 'Average Delay of Popular Airports')

We are checking here to see the Origins with the greatest delays where there are greater than 50 entries from that particular origin. The reason for this is because these are among the most popular airports in our dataset and give insight into the distribution of delays. Baltimore-Washington, JFK, and Detroit Metropolitan are among top 3 airports with the greatest average delay.

group_by_uniquecarrier <- all_airlines_full_values %>% 
  group_by(UniqueCarrier)

carrier_graphing_data <- group_by_uniquecarrier %>% 
  summarise(avg_delay = mean(ArrDelay))

ggplot((carrier_graphing_data[1:10,])) +
  geom_col(aes(x = UniqueCarrier, y = avg_delay), fill = 'maroon3')+
  labs(x = 'Unique Carrier', y = "Average Carrier Delay Minutes")

ggplot((carrier_graphing_data[10:20,])) +
  geom_col(aes(x = UniqueCarrier, y = avg_delay), fill = 'maroon3')+
  labs(x = 'Unique Carrier', y = "Average Carrier Delay Minutes")

The reason that the graphs are split up is because it would be too difficult to read the names of each airline on the x-axis if they were all on one. JetBlue and Endeavor are the two airlines with the greatest average delay, both being over 40 minutes

set.seed(1234)

flight_split <- initial_split(all_airlines_full_values, prop = 0.80,
                                strata = ArrDelay)
flight_train <- training(flight_split)
flight_test <- testing(flight_split)

We are splitting the data here in order to have both a training set to train our models and testing set to verify that our models are working well outside of the training data. A proportion of 0.8 was used for the split because we have a large number of observations and the testing set will be sufficiently large. We are stratifying on our outcome variable that we are interested in which is Arrival Delay.

flight_folds <- vfold_cv(flight_train, v = 10)

Flight folds will be used for our cross-validation which is useful because rather than having just a training and testing set we will have multiple folds that can each be used as validation sets while the other folds are used as the training sets. Each of the folds validate each other which is useful for fitting our models.

Model Building and Understanding Results

  flight_recipe <- recipe(ArrDelay~UniqueCarrier+Month+CRSDepTime+CRSArrTime+CRSElapsedTime+Distance+TaxiIn+TaxiOut+DepDelay,  data = flight_train)%>% 
    step_center() %>% 
    step_scale() %>% 
    step_corr(threshold = 0.8) %>% 
    step_dummy(all_nominal_predictors()) 

Our aim with this recipe is to be able to predict the Arrival Delay of a flight from various factors that come before the arrival. These various factors include: Unique Carrier, Month, Scheduled Departure Time, Scheduled Arrival Time, Scheduled Elapsed Time, Distance of flight, Taxi In and Taxi Out time, and overall Departure Delay.

Linear Regression Model

With this following linear regression model we add in our recipe, with Arrival Delay being the numeric outcome. We create our workflow and then fit the model to the training data. Let’s see what our results are.

lm_model <- linear_reg() %>%
  set_mode('regression') %>% 
  set_engine("lm")

lm_wflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(flight_recipe)



linear_fit <- fit(lm_wflow, data = flight_train)

linear_fit %>%
  
  extract_fit_parsnip() %>%
  
  tidy()
## # A tibble: 38 × 5
##    term                  estimate std.error statistic   p.value
##    <chr>                    <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)          -1.41      1.30        -1.09  2.77e-  1
##  2 CRSDepTime           -0.00104   0.000579    -1.79  7.34e-  2
##  3 CRSArrTime           -0.000973  0.000531    -1.83  6.68e-  2
##  4 CRSElapsedTime       -0.294     0.0138     -21.4   3.70e- 97
##  5 Distance              0.0336    0.00167     20.1   1.75e- 86
##  6 TaxiIn                0.941     0.0309      30.5   1.62e-187
##  7 TaxiOut               0.890     0.0109      81.4   0        
##  8 DepDelay              0.976     0.00268    364.    0        
##  9 UniqueCarrier_Alaska  0.286     1.09         0.262 7.94e-  1
## 10 UniqueCarrier_Aloha   0.217     1.27         0.171 8.64e-  1
## # … with 28 more rows
predictions_linear <- predict(linear_fit, new_data = flight_train)

flight_metrics <- metric_set(rmse, rsq, mae)
flight_metrics(predictions_linear, truth = flight_train$ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      11.9  
## 2 rsq     standard       0.965
## 3 mae     standard       7.95

After using our model to predict the testing data and then evaluating our metrics we can see that our RMSE (root mean squared error) is 11.87 and our rsq (r-squared) value is 0.96. Our mae (mean absolute error) is 7.95. Generally this would indicate that our errors are not ideal but this model explains the variation in the data well as shown from our rsq value.

Ridge Regression

ridge_recipe <- 
  flight_recipe %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())



ridge_spec <- linear_reg(mixture = 0, penalty = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")


ridge_wflow <- workflow() %>% 
  add_model(ridge_spec) %>% 
  add_recipe(ridge_recipe)



penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)



ridge_tune_res <- tune_grid(
  ridge_wflow,
  resamples = flight_folds, 
  grid = penalty_grid
)



best_penalty <- select_best(ridge_tune_res, metric = "rmse")
best_penalty
## # A tibble: 1 × 2
##   penalty .config              
##     <dbl> <chr>                
## 1 0.00001 Preprocessor1_Model01
ridge_final <- finalize_workflow(ridge_wflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = flight_train)


augment(ridge_final_fit, new_data = flight_test) %>%
  metrics(truth = ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      12.9  
## 2 rsq     standard       0.967
## 3 mae     standard       8.94
autoplot(ridge_tune_res)

Our ridge regression was tuned by having the penalty range between -5 and 5, and the model by using different penalties and using resampling from our flight folds. After fitting this Ridge Regression we looked at the best RMSE value which was 12.87. This was from Preprocessor1_Model01. The penalty was 1e-05. The rsq was 0.967 and mae was 8.94

K Nearest Neighbors

knn_spec <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "regression") %>% 
  set_engine("kknn")


knn_wkflow <- workflow() %>% 
  add_model(knn_spec) %>% 
  add_recipe(flight_recipe)


knn_parameters<- parameters(knn_spec)



knn_grid <- grid_regular(knn_parameters, levels = 2)


tuned_neighbors <- knn_wkflow %>% 
  tune_grid(
    resamples = flight_folds, 
            grid = knn_grid)



save(tuned_neighbors, file = '~/project_save_files/tuned_neighbors_model.rda')
load(file = '~/project_save_files/tuned_neighbors_model.rda')

autoplot(tuned_neighbors)

show_best(tuned_neighbors, metric = 'rmse')
## # A tibble: 2 × 7
##   neighbors .metric .estimator  mean     n std_err .config             
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1         1 rmse    standard    42.3    10   0.798 Preprocessor1_Model1
## 2        15 rmse    standard    43.2    10   1.08  Preprocessor1_Model2

We can see that our RMSE for the K Nearest Neighbors tuned model is higher than our other models.

Random Forest and Boosted Trees

For the following models, we will save the results and then load them from their respective files to save time when computing. The analysis will be done after loading the files in.

The random forest model has several important arguments:\(\textbf{mtry}\) is essentially a measurement of how many predictors will be randomly sampled each time there is a split for the tree models, ours had a range of 1-4.\(\textbf{trees}\) is simply the number of trees that will be contained in the tuned model, our range was from 90-160. \(\textbf{min_n}\) is the minimum amount of data points contained in a node/leaf in order for it to be split again with our range being from 10-150. We will be able to select which values work best for these parameters using tuning.

r_forest_model <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("regression")

r_forest_wflow <- workflow() %>%
  add_model(r_forest_model %>% set_args(mtry = tune(),
                            trees = tune(),
                            min_n = tune())) %>%
  add_recipe(flight_recipe)

forest_grid <- grid_regular(mtry(range = c(2, 12)), 
                                            trees(range = c(90, 160)),
                                            min_n(range = c(10, 150)), levels = 8)


rf_tune_res <- tune_grid(r_forest_wflow, resamples = flight_folds, grid = forest_grid, metrics = metric_set(rsq, rmse))


save(rf_tune_res, file = '~/project_save_files/flight_randf.rda')
save(r_forest_wflow, file = '~/project_save_files/r_forest_wflow.rda')

For our boosted tree our \(\textbf{mtry}\) ranged from 2-20, our \(\textbf{trees}\) ranged from 100-190 and our \(\textbf{min_n}\) ranged from 10-150. We also have a learn rate which ranges from -2 - 0.5 in our case.

boost_spec <- boost_tree(trees = tune(),
                         mtry = tune(),
                         min_n = tune(),
                         learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

boost_wflow <- workflow() %>% 
  add_recipe(flight_recipe) %>% 
  add_model(boost_spec)


boost_grid <- grid_regular(mtry(range = c(2, 20)), 
                                            trees(range = c(100, 190)),
                                            min_n(range = c(10, 150)), learn_rate(range = c(-2,0.5)))

boost_tune_res <- tune_grid(boost_wflow, resamples = flight_folds, grid = boost_grid, metrics = metric_set(rsq, rmse))





save(boost_tune_res, file ='~/project_save_files/boost_tune_res.rda')
load(file = '/Users/kerouac/project_save_files/boost_tune_res.rda')
load(file = '/Users/kerouac/project_save_files/flight_randf.rda')
load(file = '/Users/kerouac/project_save_files/r_forest_wflow.rda')

Let’s Analyze the random forest results

show_best(rf_tune_res, metric = 'rmse')
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    12   110    10 rmse    standard    9.35    10   0.180 Preprocessor1_Model0…
## 2    12   130    10 rmse    standard    9.37    10   0.177 Preprocessor1_Model0…
## 3    12   140    10 rmse    standard    9.39    10   0.175 Preprocessor1_Model0…
## 4    12   160    10 rmse    standard    9.39    10   0.171 Preprocessor1_Model0…
## 5    12   150    10 rmse    standard    9.39    10   0.177 Preprocessor1_Model0…
show_best(rf_tune_res, metric = 'rsq')
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    12   110    10 rsq     standard   0.718    10  0.0105 Preprocessor1_Model0…
## 2    12   130    10 rsq     standard   0.717    10  0.0104 Preprocessor1_Model0…
## 3    12   160    10 rsq     standard   0.716    10  0.0101 Preprocessor1_Model0…
## 4    12   140    10 rsq     standard   0.716    10  0.0103 Preprocessor1_Model0…
## 5    12   150    10 rsq     standard   0.716    10  0.0103 Preprocessor1_Model0…

Our tuned random forest yielded the best model (Preprocessor1_Model024) with an RMSE of 9.34 and an R squared value of 0.718. This was from an \(\textbf{mtry}\) value of 12, \(\textbf{trees}\) value of 110, and \(\textbf{min_n}\) value of 10.

autoplot(rf_tune_res)

We can see the general trend that as the number of selected predictors increases, the RMSE decreases. Since this forest was created with a large number of trees for each model, the effect of the number of trees is less pronounced. With this being said, around 110 trees yielded the most optimal results

Now to look at the boosted tree model

show_best(boost_tune_res, metric = 'rmse') 
## # A tibble: 5 × 10
##    mtry trees min_n learn_rate .metric .estimator  mean     n std_err .config   
##   <int> <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
## 1    11   100    10      0.178 rmse    standard    16.5    10    1.96 Preproces…
## 2    11   145    10      0.178 rmse    standard    16.6    10    1.97 Preproces…
## 3    11   190    10      0.178 rmse    standard    16.7    10    1.99 Preproces…
## 4    20   100    10      0.178 rmse    standard    16.7    10    2.25 Preproces…
## 5    20   145    10      0.178 rmse    standard    16.9    10    2.28 Preproces…
show_best(boost_tune_res, metric ='rsq')
## # A tibble: 5 × 10
##    mtry trees min_n learn_rate .metric .estimator  mean     n std_err .config   
##   <int> <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
## 1    11   100    10      0.178 rsq     standard   0.934    10  0.0107 Preproces…
## 2    11   145    10      0.178 rsq     standard   0.933    10  0.0109 Preproces…
## 3    20   100    10      0.178 rsq     standard   0.932    10  0.0135 Preproces…
## 4    11   190    10      0.178 rsq     standard   0.932    10  0.0111 Preproces…
## 5    20   190    10      0.178 rsq     standard   0.931    10  0.0140 Preproces…
autoplot(boost_tune_res)

From our boosted tree model we can see that an \(\textbf{mtry}\) value of 20, \(\textbf{trees}\) value of 100, a learn rate of 0.177 and \(\textbf{min_n}\) value of 10 gave us our best metrics.The best rsq value was 0.9358 and the best RMSE was 16.32

Best Model Performance on Testing Data

From the various models that we created, the overall best metrics were seen with the linear regression model and the Random Forest Model in terms of RMSE. There might have been a degree of overfitting so we will look at both. Let’s visualize the performance in comparison to the testing set further.

Linear Regression Testing

delay_predict <- predict(linear_fit,  
                              new_data = flight_test)

delay_prediction_compare <- delay_predict %>%
  bind_cols(Actual = flight_test$ArrDelay) 

delay_prediction_compare
## # A tibble: 1,295 × 2
##    .pred Actual
##    <dbl>  <dbl>
##  1  16.1     19
##  2  77.0     71
##  3  29.0     19
##  4 146.     157
##  5  30.2     21
##  6  21.9     24
##  7  20.9     15
##  8 103.     100
##  9  54.4     49
## 10  22.0     15
## # … with 1,285 more rows

Here is a quick look at compared versus actual values.

ggplot(delay_prediction_compare,                                     
       aes(x = .pred,
           y = Actual)) +
  geom_point() +
  xlim(0,400)+
  ylim(0,400)+
  geom_abline(intercept = 0,
              slope = 1,
              color = "red3",
              size = 1)+
  labs(title = 'Linear Regression Model',x = 'Predicted')

Here is a graphical representation of how the predicted values compare to the actual values.

linear_metrics <- metric_set(rmse, rsq, mae)
linear_metrics(delay_predict, truth = flight_test$ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      10.8  
## 2 rsq     standard       0.971
## 3 mae     standard       7.46

Our metrics for the linear model when fitting to the testing data were even slightly better than the training data which means that our model fits the entirety of the data quite well. RMSE here is 10.8, Rsq is 0.97, and Mae is 7.45

Random Forest Testing

rf_final_wf <- r_forest_wflow %>%
  finalize_workflow(select_best(rf_tune_res, metric = "rmse"))

results_of_final_wf <- fit(rf_final_wf, flight_train)




best_forest_predict <- predict(results_of_final_wf,  
                              new_data = flight_test)

delay_prediction_compare_rf <- best_forest_predict %>%
  bind_cols(Actual = flight_test$ArrDelay)

delay_prediction_compare_rf
## # A tibble: 1,295 × 2
##    .pred Actual
##    <dbl>  <dbl>
##  1  25.2     19
##  2  85.3     71
##  3  28.4     19
##  4 136.     157
##  5  27.9     21
##  6  21.0     24
##  7  26.0     15
##  8  94.9    100
##  9  51.3     49
## 10  22.3     15
## # … with 1,285 more rows

This tibble shows the predictions versus actual values.

ggplot(delay_prediction_compare_rf,                                     
       aes(x = .pred,
           y = Actual)) +
  geom_point() +
  xlim(0,400)+
  ylim(0,400)+
  geom_abline(intercept = 0,
              slope = 1,
              color = "red3",
              size = 1)+
  labs(title = 'Random Forest Model',x = 'Predicted')

This is the same actual versus predicted graph but for the random forest

rf_metrics <- metric_set(rmse, rsq, mae)
rf_metrics(best_forest_predict, truth = flight_test$ArrDelay, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      19.3  
## 2 rsq     standard       0.918
## 3 mae     standard       9.54

Our RMSE value here is 19.46 which is much higher than it was for the training data. RSQ is 0.91 and MAE is 9.57.

Conclusion

In conclusion, the models that we have created all had unique results. Initially, the data seemed to be somewhat difficult to fit to models given the large number of observations and variables. After shrinking down the model and cleaning the data the model fitting became much easier. This highlights the importance of becoming familiar with the data prior to implementing models.

Ultimately, both the random forest model and the linear regression model had good metrics when creating them. We aimed to see how various factors prior to arrival can affect arrival delay, and these two models gave us the best results compared to the others. However, once we fit the models to the testing data, the regression model performed better indicating that there might have been some overfitting with the random forest.

The idea behind this project was very fascinating and there are some possible improvements that can be implemented for similar types of projects. Firstly, other variables and data that can predict delays would potentially allow for even more interesting models. Having a dataset with more obscure data that could potentially predict delays better would be fascinating to explore. Secondly, perhaps unsupervised learning methods like neural networks would be effective for this type of data.

Thats all, folks.